This is V2 of this sheet. The first one I froze a save of as I believe it had some problems with the methodology of my data pre-processing. BUT, not to worry. This is V2, which is based on the same bones. So this should ideally fix some of the issues and get a better model.
So this is basically me playing with this one question: What can you can guess from this data?
You can get the data from here: https://www.ons.gov.uk/census/maps/choropleth/, or in bulk (and a better format here) https://www.nomisweb.co.uk/sources/census_2021_bulk.
The Election data is available here:
Obviously, doesn't narrow down what I am actually talking about so let's be more specific about some of the questions:
"Could you guess how an area voted in the 2024 general election?" is the most interesting question to be honest. Let's work with this.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.preprocessing import StandardScaler, KBinsDiscretizer, QuantileTransformer
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, VotingClassifier
from sklearn.svm import SVC
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.inspection import permutation_importance
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score, mean_squared_error, mean_absolute_error, r2_score
from imblearn.over_sampling import SMOTE
import os
current_dir = os.getcwd()
census_data_dir = (current_dir + '/DataSet/Census Data/')
election_data_dir = (current_dir + '/DataSet/Election Data/')
mapping_data_dir = (current_dir + '/DataSet/Mapping/')
# This is the random state variable that will be used throughout the code.
random_code = 42
Before I go any further, it is worth exploring what each dataset suffix refers to.
Regarding the election data though, we have 1 problem. The census doesn't record these within electoral bounds. None of the above listed are anything to do with electoral stuff. So we have a task to map one of these to the electoral map, somehow. We have some solutions:
One more thing. All the data headers are the first row for each CSV, so I'll need to extract the definitions from there and get whatever is usable. I did download all the data, which may not be useful for me to work. So I may have given myself a task to clean up a lot of it.
df = pd.read_csv(os.path.join(census_data_dir, 'census2021-ts002-rgn.csv'))
df.head()
The election data, I'll keep as two separate tables for two separate elections. Makes it a bit easier for me when doing the analysis.
GE2019_df = pd.read_csv(os.path.join(election_data_dir, 'HOC-GE2019-results-by-constituency.csv'))
GE2019_df.dropna(axis=1,inplace=True)
GE2019_df.drop(columns=['ONS region ID','Region name','Country name','Constituency type','Member first name','Member surname','Member gender',"Result","Second party", "Majority","APNI", "UUP", "SDLP", "SF", "DUP", "SNP","Of which other winner"], inplace=True)
GE2019_df.rename(columns={'ONS ID':'PCON25CD','Constituency name':'PCON25NM','First party':'Elected','All other candidates':'Ind'}, inplace=True)
GE2024_df = pd.read_csv(os.path.join(election_data_dir, 'HOC-GE2024-results-by-constituency.csv'))
GE2024_df.dropna(axis=1,inplace=True)
GE2024_df.drop(columns=['ONS region ID','Region name','Country name','Constituency type','Member first name','Member surname','Member gender',"Result","Second party", "Majority","APNI", "UUP", "SDLP", "SF", "DUP", "SNP","Of which other winner"], inplace=True)
GE2024_df.rename(columns={'ONS ID':'PCON25CD','Constituency name':'PCON25NM', 'First party':'Elected','All other candidates':'Ind'}, inplace=True)
GE2024_df.head()
Now I do think it's worth keeping the raw numbers but I want these in a bit better format of percentage of the votes. So I'm going to create two new data frames to change the numbers to percents.
GE2019_frac_df = GE2019_df.copy()
GE2019_frac_df.drop(columns=['Electorate','Valid votes', 'Invalid votes'], inplace=True)
GE2019_frac_df[GE2019_frac_df.columns[3:]] = GE2019_df[GE2019_df.columns[6:]].div(GE2019_df[GE2019_df.columns[6:]].sum(axis=1), axis=0)
GE2024_frac_df = GE2024_df.copy()
GE2024_frac_df.drop(columns=['Electorate','Valid votes', 'Invalid votes'], inplace=True)
GE2024_frac_df[GE2024_frac_df.columns[3:]] = GE2024_df[GE2024_df.columns[6:]].div(GE2024_df[GE2024_df.columns[6:]].sum(axis=1), axis=0)
GE2024_frac_df.head()
Finally, let's get the mapping data and make one unified mapping data frame. We'll preserve the names too because we can use this for our human readableness.
MSOA2021_to_PCON2024_df = pd.read_csv(os.path.join(mapping_data_dir, 'MSOA_(2021)_to_future_Parliamentary_Constituencies_Lookup_in_England_and_Wales.csv'))
OA2021_to_PCON2024_df = pd.read_csv(os.path.join(mapping_data_dir, 'Output_area_(2021)_to_future_Parliamentary_Constituencies_Lookup_in_England_and_Wales.csv'))
MSOA2021_to_PCON2024_df.dropna(axis=1, inplace=True)
OA2021_to_PCON2024_df.dropna(axis=1, inplace=True)
MSOA2021_to_PCON2024_df.head()
OA2021_to_PCON2024_df.head()
csv_names = []
csv_names_oa = []
csv_names_msoa = []
set_names = []
set_names_oa = []
set_names_msoa = []
files = os.listdir(census_data_dir)
for file in files:
if '.csv' in file: # We only care about the CSVs
csv_names.append(file)
set_names.append(file[11:16])
if '-msoa' in file:
csv_names_msoa.append(file)
set_names_msoa.append(file[11:16])
elif '-oa' in file:
csv_names_oa.append(file)
set_names_oa.append(file[11:16])
else:
pass
# Summary stuff
print("There is a total of %d unique data sets" %(len(set(set_names))))
print("%d contain 'OA' in their geography" %(len(set(csv_names_oa))))
print("%d contain 'MSOA' in their geography" %(len(set(csv_names_msoa))))
# Get the missing data sets
missing_oa = set(set_names) - set(set_names_oa)
missing_msoa = set(set_names) - set(set_names_msoa)
missing_both = missing_msoa & missing_oa
print("The missing data sets for the OA areas are: " + str(missing_oa))
print("The missing data sets for the MSOA areas are: " + str(missing_msoa))
print("The data sets missing both the OA and MSOA areas are: " + str(missing_both))
Kind of annoying we have some useless sets then. But it is what it is. Looking manually the following are unusable:
What we can see though is that MSOA has all the useful codes, so this is what we should use as our list. I'll do most of my analysis with MSOA anyways, as I think the OA will be too granular as well thinking of it, and I may overfit.
#filtered_files = [file for file in files if any(name in file for name in set_names_msoa) and ('-oa' in file or '-msoa' in file)]
filtered_files_msoa = [file for file in files if any(name in file for name in set_names_msoa) and '-msoa' in file]
filtered_files_oa = [file for file in files if any(name in file for name in set_names_msoa) and '-oa' in file]
print(filtered_files_msoa)
print(filtered_files_oa)
"""# Let's put them into a dictionary to reference
filtered_dict = {}
for file in filtered_files:
dataset_name = file[11:16]
if dataset_name not in filtered_dict:
filtered_dict[dataset_name] = []
filtered_dict[dataset_name].append(file)
print(filtered_dict)"""
We'll group these then into a few different categories. The idea is that rather than one big model (as we attempted last time) we can have a bunch of 'areas of concern' that we can group our census data into. We can then try and use these like a voting system where each model then votes on how they think the constituency went. We'll also drop some of the verbose models which just have too much data for us to use.
Our groups will be:
Unclaimed:
natpop_census_categories = ["TS001", "TS002", "TS004", "TS010", "TS019", "TS025", "TS029", "TS032"]
natpop_census_categories = [element.lower() for element in natpop_census_categories]
spirh_census_categories = ["TS008","TS021","TS027","TS030","TS037","TS038","TS075","TS077","TS078"]
spirh_census_categories = [element.lower() for element in spirh_census_categories]
housing_census_categories = ["TS003","TS017","TS044","TS045","TS046","TS050","TS051","TS052","TS053","TS054","TS056"]
housing_census_categories = [element.lower() for element in housing_census_categories]
eeo_census_categories = ["TS011","TS039","TS059","TS060","TS061","TS062","TS063","TS066,","TS067","TS068"," TS071"]
eeo_census_categories = [element.lower() for element in eeo_census_categories]
We'll also need to do a bit of finessing each data set, as some have annoying things like totals....
census_natpop_pcon_df = None
census_spirh_pcon_df = None
census_housing_pcon_df = None
census_eeo_pcon_df = None
for file in filtered_files_msoa: # Loop through our files
name = file[11:16]
placeholder_df = pd.read_csv(os.path.join(census_data_dir, file))
placeholder_df.drop(columns=['date'], inplace=True)
placeholder_df.rename(columns={'geography code': 'MSOA21CD', 'geography': "MSOA21NM"}, inplace=True)
columns_to_drop = [col for col in placeholder_df.columns if 'total' in col.lower()]
placeholder_df.drop(columns=columns_to_drop, inplace=True)
placeholder_df = placeholder_df.merge(MSOA2021_to_PCON2024_df, on='MSOA21CD', how='left', copy=False)
placeholder_df.drop(columns=['MSOA21NM_x','MSOA21NM_y','MSOA21CD','LAD21CD','LAD21NM','ObjectId','PCON25NM'], inplace=True)
# Just move geography columns to 'front' of the dataframe
placeholder_df = placeholder_df[ ['PCON25CD'] + [ col for col in placeholder_df.columns if col != 'PCON25CD' ] ]
# Now we want to put these so that one row = one constituency
placeholder_df = placeholder_df.groupby(['PCON25CD'], as_index=False).aggregate('sum').reindex(columns=placeholder_df.columns)
placeholder_df_numeric = placeholder_df.select_dtypes(include=[np.number])
placeholder_df[placeholder_df_numeric.columns] = placeholder_df_numeric.div(placeholder_df.sum(axis=1), axis=0)
if name in natpop_census_categories:
if census_natpop_pcon_df is None:
census_natpop_pcon_df = placeholder_df
else:
census_natpop_pcon_df = census_natpop_pcon_df.merge(placeholder_df, on='PCON25CD', how='left', copy=False)
elif name in spirh_census_categories:
if census_spirh_pcon_df is None:
census_spirh_pcon_df = placeholder_df
else:
census_spirh_pcon_df = census_spirh_pcon_df.merge(placeholder_df, on='PCON25CD', how='left', copy=False)
elif name in housing_census_categories:
if census_housing_pcon_df is None:
census_housing_pcon_df = placeholder_df
else:
census_housing_pcon_df = census_housing_pcon_df.merge(placeholder_df, on='PCON25CD', how='left', copy=False)
elif name in eeo_census_categories:
if census_eeo_pcon_df is None:
census_eeo_pcon_df = placeholder_df
else:
census_eeo_pcon_df = census_eeo_pcon_df.merge(placeholder_df, on='PCON25CD', how='left', copy=False)
else:
print("Uncategorised: " + str(name))
print("Total number of constituencies in set: %d" %census_eeo_pcon_df.shape[0])
Wikipedia: https://en.wikipedia.org/wiki/Constituencies_of_the_Parliament_of_the_United_Kingdom
"As of the 2024 election there are 543 constituencies in England, 32 in Wales, 57 in Scotland and 18 in Northern Ireland." Meaning there are supposed 575 constituencies in our data set
print("NatPop: Shape of the final dataframe: " + str(census_natpop_pcon_df.shape))
print("SPIRH: Shape of the final dataframe: " + str(census_spirh_pcon_df.shape))
print("Housing: Shape of the final dataframe: " + str(census_housing_pcon_df.shape))
print("EEO: Shape of the final dataframe: " + str(census_eeo_pcon_df.shape))
if len(census_natpop_pcon_df.columns) == len(set(census_natpop_pcon_df.columns)):
print("NatPop: No duplicates")
else:
print("NatPop: Duplicates found")
if len(census_spirh_pcon_df.columns) == len(set(census_spirh_pcon_df.columns)):
print("SPIRH: No duplicates")
else:
print("SPIRH: Duplicates found")
if len(census_housing_pcon_df.columns) == len(set(census_housing_pcon_df.columns)):
print("Housing: No duplicates")
else:
print("Housing: Duplicates found")
if len(census_eeo_pcon_df.columns) == len(set(census_eeo_pcon_df.columns)):
print("EEO: No duplicates")
else:
print("EEO: Duplicates found")
As we can see, we can see all the missing ones are the Welsh abilities. Which makes sense as looking at the census, it is only Wales that records Welsh ability. England doesn't record this. It's not the best, but I will put '0' for the entirety of England.
census_natpop_pcon_df.fillna(0, inplace=True)
if len(census_natpop_pcon_df.isna().any()[lambda x: x]) == 0:
print("NatPop: No missing values")
else:
print("NatPop: Missing values found")
print(census_natpop_pcon_df.isna().any()[lambda x: x])
if len(census_spirh_pcon_df.isna().any()[lambda x: x]) == 0:
print("SPIRH: No missing values")
else:
print("SPIRH: Missing values found")
print(census_spirh_pcon_df.isna().any()[lambda x: x])
if len(census_housing_pcon_df.isna().any()[lambda x: x]) == 0:
print("Housing: No missing values")
else:
print("Housing: Missing values found")
print(census_housing_pcon_df.isna().any()[lambda x: x])
if len(census_eeo_pcon_df.isna().any()[lambda x: x]) == 0:
print("EEO: No missing values")
else:
print("EEO: Missing values found")
print(census_eeo_pcon_df.isna().any()[lambda x: x])
Now let's put the election and census data together.
census_natpop_GE2024_df = census_natpop_pcon_df.merge(GE2024_frac_df, on='PCON25CD', how='inner', copy=False)
census_spirh_GE2024_df = census_spirh_pcon_df.merge(GE2024_frac_df, on='PCON25CD', how='inner', copy=False)
census_housing_GE2024_df = census_housing_pcon_df.merge(GE2024_frac_df, on='PCON25CD', how='inner', copy=False)
census_eeo_GE2024_df = census_eeo_pcon_df.merge(GE2024_frac_df, on='PCON25CD', how='inner', copy=False)
census_eeo_GE2024_df.head()
census_eeo_GE2024_df.head()
Now lets see if we can make out anything interesting from this information here. Because we have a lot of stuff to look at in this data but we need to make sure we make the data usable.
census_natpop_GE2024_df_numeric = census_natpop_GE2024_df.select_dtypes(include=['int64', 'float64'])
census_housing_GE2024_df_numeric = census_housing_GE2024_df.select_dtypes(include=['int64', 'float64'])
census_eeo_GE2024_df_numeric = census_eeo_GE2024_df.select_dtypes(include=['int64', 'float64'])
census_spirh_GE2024_df_numeric = census_spirh_GE2024_df.select_dtypes(include=['int64', 'float64'])
corr_natpop = census_natpop_GE2024_df_numeric.corr()
corr_housing = census_housing_GE2024_df_numeric.corr()
corr_eeo = census_eeo_GE2024_df_numeric.corr()
corr_spirh = census_spirh_GE2024_df_numeric.corr()
This is a cool plot but completely unusable for us to see anything with, we'll need to filter this down a bit.
k = 10 #number of variables for heatmap
cols_natpop = corr_natpop.nlargest(k, 'Lab')['Lab'].index
cols_eeo = corr_eeo.nlargest(k, 'Lab')['Lab'].index
cols_housing = corr_housing.nlargest(k, 'Lab')['Lab'].index
cols_spirh = corr_spirh.nlargest(k, 'Lab')['Lab'].index
cm_n = np.corrcoef(census_natpop_GE2024_df_numeric[cols_natpop].values.T)
cm_e = np.corrcoef(census_eeo_GE2024_df_numeric[cols_eeo].values.T)
cm_h = np.corrcoef(census_housing_GE2024_df_numeric[cols_housing].values.T)
cm_s = np.corrcoef(census_spirh_GE2024_df_numeric[cols_spirh].values.T)
fig, axes = plt.subplots(2, 2, figsize=(20, 15))
sns.set(font_scale=1.25)
hm_n = sns.heatmap(cm_n, ax=axes[0, 0], cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size':10}, yticklabels=cols_natpop.values, xticklabels=cols_natpop.values)
hm_n.tick_params(left=True, bottom=False, labelleft=True, labelbottom=False, right=False, top=False, labelright=False, labeltop=False, labelrotation=0)
hm_h = sns.heatmap(cm_e, ax=axes[0, 1], cbar=False, annot=True, square=True, fmt='.2f', annot_kws={'size':10}, yticklabels=cols_housing.values, xticklabels=cols_housing.values)
hm_h.tick_params(left=False, bottom=False, labelleft=False, labelbottom=False, right=True, top=False, labelright=True, labeltop=False, labelrotation=0)
hm_h.invert_xaxis()
hm_e = sns.heatmap(cm_h, ax=axes[1, 0], cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size':10}, yticklabels=cols_eeo.values, xticklabels=cols_eeo.values)
hm_e.tick_params(left=True, bottom=False, labelleft=True, labelbottom=False, right=False, top=False, labelright=False, labeltop=False, labelrotation=0)
hm_s = sns.heatmap(cm_s, ax=axes[1, 1], cbar=False, annot=True, square=True, fmt='.2f', annot_kws={'size':10}, yticklabels=cols_spirh.values, xticklabels=cols_spirh.values)
hm_s.tick_params(left=False, bottom=False, labelleft=False, labelbottom=False, right=True, top=False, labelright=True, labeltop=False, labelrotation=0)
hm_s.invert_xaxis()
axes[0, 0].set_title('NatPop Correlation Matrix for if you voted Labour - N Largest')
axes[0, 1].set_title('Housing Correlation Matrix for if you voted Labour - N Largest')
axes[1, 0].set_title('EEO Correlation Matrix for if you voted Labour - N Largest')
axes[1, 1].set_title('SPIRH Correlation Matrix for if you voted Labour - N Largest')
plt.tight_layout
plt.show()
k = 10 #number of variables for heatmap
cols_natpop = corr_natpop.nsmallest(k, 'Lab')['Lab'].index
cols_eeo = corr_eeo.nsmallest(k, 'Lab')['Lab'].index
cols_housing = corr_housing.nsmallest(k, 'Lab')['Lab'].index
cols_spirh = corr_spirh.nsmallest(k, 'Lab')['Lab'].index
cm_n = np.corrcoef(census_natpop_GE2024_df_numeric[cols_natpop].values.T)
cm_e = np.corrcoef(census_eeo_GE2024_df_numeric[cols_eeo].values.T)
cm_h = np.corrcoef(census_housing_GE2024_df_numeric[cols_housing].values.T)
cm_s = np.corrcoef(census_spirh_GE2024_df_numeric[cols_spirh].values.T)
fig, axes = plt.subplots(2, 2, figsize=(20, 15))
sns.set(font_scale=1.25)
hm_n = sns.heatmap(cm_n, ax=axes[0, 0], cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size':10}, yticklabels=cols_natpop.values, xticklabels=cols_natpop.values)
hm_n.tick_params(left=True, bottom=False, labelleft=True, labelbottom=False, right=False, top=False, labelright=False, labeltop=False, labelrotation=0)
hm_h = sns.heatmap(cm_e, ax=axes[0, 1], cbar=False, annot=True, square=True, fmt='.2f', annot_kws={'size':10}, yticklabels=cols_housing.values, xticklabels=cols_housing.values)
hm_h.tick_params(left=False, bottom=False, labelleft=False, labelbottom=False, right=True, top=False, labelright=True, labeltop=False, labelrotation=0)
hm_h.invert_xaxis()
hm_e = sns.heatmap(cm_h, ax=axes[1, 0], cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size':10}, yticklabels=cols_eeo.values, xticklabels=cols_eeo.values)
hm_e.tick_params(left=True, bottom=False, labelleft=True, labelbottom=False, right=False, top=False, labelright=False, labeltop=False, labelrotation=0)
hm_s = sns.heatmap(cm_s, ax=axes[1, 1], cbar=False, annot=True, square=True, fmt='.2f', annot_kws={'size':10}, yticklabels=cols_spirh.values, xticklabels=cols_spirh.values)
hm_s.tick_params(left=False, bottom=False, labelleft=False, labelbottom=False, right=True, top=False, labelright=True, labeltop=False, labelrotation=0)
hm_s.invert_xaxis()
axes[0, 0].set_title('NatPop Correlation Matrix for if you voted Labour - N Smallest')
axes[0, 1].set_title('Housing Correlation Matrix for if you voted Labour - N Smallest')
axes[1, 0].set_title('EEO Correlation Matrix for if you voted Labour - N Smallest')
axes[1, 1].set_title('SPIRH Correlation Matrix for if you voted Labour - N Smallest')
plt.tight_layout
plt.show()
k = 10 #number of variables for heatmap
cols_natpop = corr_natpop.nlargest(k, 'Con')['Con'].index
cols_eeo = corr_eeo.nlargest(k, 'Con')['Con'].index
cols_housing = corr_housing.nlargest(k, 'Con')['Con'].index
cols_spirh = corr_spirh.nlargest(k, 'Con')['Con'].index
cm_n = np.corrcoef(census_natpop_GE2024_df_numeric[cols_natpop].values.T)
cm_e = np.corrcoef(census_eeo_GE2024_df_numeric[cols_eeo].values.T)
cm_h = np.corrcoef(census_housing_GE2024_df_numeric[cols_housing].values.T)
cm_s = np.corrcoef(census_spirh_GE2024_df_numeric[cols_spirh].values.T)
fig, axes = plt.subplots(2, 2, figsize=(20, 15))
sns.set(font_scale=1.25)
hm_n = sns.heatmap(cm_n, ax=axes[0, 0], cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size':10}, yticklabels=cols_natpop.values, xticklabels=cols_natpop.values)
hm_n.tick_params(left=True, bottom=False, labelleft=True, labelbottom=False, right=False, top=False, labelright=False, labeltop=False, labelrotation=0)
hm_h = sns.heatmap(cm_e, ax=axes[0, 1], cbar=False, annot=True, square=True, fmt='.2f', annot_kws={'size':10}, yticklabels=cols_housing.values, xticklabels=cols_housing.values)
hm_h.tick_params(left=False, bottom=False, labelleft=False, labelbottom=False, right=True, top=False, labelright=True, labeltop=False, labelrotation=0)
hm_h.invert_xaxis()
hm_e = sns.heatmap(cm_h, ax=axes[1, 0], cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size':10}, yticklabels=cols_eeo.values, xticklabels=cols_eeo.values)
hm_e.tick_params(left=True, bottom=False, labelleft=True, labelbottom=False, right=False, top=False, labelright=False, labeltop=False, labelrotation=0)
hm_s = sns.heatmap(cm_s, ax=axes[1, 1], cbar=False, annot=True, square=True, fmt='.2f', annot_kws={'size':10}, yticklabels=cols_spirh.values, xticklabels=cols_spirh.values)
hm_s.tick_params(left=False, bottom=False, labelleft=False, labelbottom=False, right=True, top=False, labelright=True, labeltop=False, labelrotation=0)
hm_s.invert_xaxis()
axes[0, 0].set_title('NatPop Correlation Matrix for if you voted Conservative - N Largest')
axes[0, 1].set_title('Housing Correlation Matrix for if you voted Conservative - N Largest')
axes[1, 0].set_title('EEO Correlation Matrix for if you voted Conservative - N Largest')
axes[1, 1].set_title('SPIRH Correlation Matrix for if you voted Conservative - N Largest')
plt.tight_layout
plt.show()
k = 10 #number of variables for heatmap
cols_natpop = corr_natpop.nsmallest(k, 'Con')['Con'].index
cols_eeo = corr_eeo.nsmallest(k, 'Con')['Con'].index
cols_housing = corr_housing.nsmallest(k, 'Con')['Con'].index
cols_spirh = corr_spirh.nsmallest(k, 'Con')['Con'].index
cm_n = np.corrcoef(census_natpop_GE2024_df_numeric[cols_natpop].values.T)
cm_e = np.corrcoef(census_eeo_GE2024_df_numeric[cols_eeo].values.T)
cm_h = np.corrcoef(census_housing_GE2024_df_numeric[cols_housing].values.T)
cm_s = np.corrcoef(census_spirh_GE2024_df_numeric[cols_spirh].values.T)
fig, axes = plt.subplots(2, 2, figsize=(20, 15))
sns.set(font_scale=1.25)
hm_n = sns.heatmap(cm_n, ax=axes[0, 0], cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size':10}, yticklabels=cols_natpop.values, xticklabels=cols_natpop.values)
hm_n.tick_params(left=True, bottom=False, labelleft=True, labelbottom=False, right=False, top=False, labelright=False, labeltop=False, labelrotation=0)
hm_h = sns.heatmap(cm_e, ax=axes[0, 1], cbar=False, annot=True, square=True, fmt='.2f', annot_kws={'size':10}, yticklabels=cols_housing.values, xticklabels=cols_housing.values)
hm_h.tick_params(left=False, bottom=False, labelleft=False, labelbottom=False, right=True, top=False, labelright=True, labeltop=False, labelrotation=0)
hm_h.invert_xaxis()
hm_e = sns.heatmap(cm_h, ax=axes[1, 0], cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size':10}, yticklabels=cols_eeo.values, xticklabels=cols_eeo.values)
hm_e.tick_params(left=True, bottom=False, labelleft=True, labelbottom=False, right=False, top=False, labelright=False, labeltop=False, labelrotation=0)
hm_s = sns.heatmap(cm_s, ax=axes[1, 1], cbar=False, annot=True, square=True, fmt='.2f', annot_kws={'size':10}, yticklabels=cols_spirh.values, xticklabels=cols_spirh.values)
hm_s.tick_params(left=False, bottom=False, labelleft=False, labelbottom=False, right=True, top=False, labelright=True, labeltop=False, labelrotation=0)
hm_s.invert_xaxis()
axes[0, 0].set_title('NatPop Correlation Matrix for if you voted Conservative - N Smallest')
axes[0, 1].set_title('Housing Correlation Matrix for if you voted Conservative - N Smallest')
axes[1, 0].set_title('EEO Correlation Matrix for if you voted Conservative - N Smallest')
axes[1, 1].set_title('SPIRH Correlation Matrix for if you voted Conservative - N Smallest')
plt.tight_layout
plt.show()
k = 10 #number of variables for heatmap
cols_natpop = corr_natpop.nlargest(k, 'LD')['LD'].index
cols_eeo = corr_eeo.nlargest(k, 'LD')['LD'].index
cols_housing = corr_housing.nlargest(k, 'LD')['LD'].index
cols_spirh = corr_spirh.nlargest(k, 'LD')['LD'].index
cm_n = np.corrcoef(census_natpop_GE2024_df_numeric[cols_natpop].values.T)
cm_e = np.corrcoef(census_eeo_GE2024_df_numeric[cols_eeo].values.T)
cm_h = np.corrcoef(census_housing_GE2024_df_numeric[cols_housing].values.T)
cm_s = np.corrcoef(census_spirh_GE2024_df_numeric[cols_spirh].values.T)
fig, axes = plt.subplots(2, 2, figsize=(20, 15))
sns.set(font_scale=1.25)
hm_n = sns.heatmap(cm_n, ax=axes[0, 0], cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size':10}, yticklabels=cols_natpop.values, xticklabels=cols_natpop.values)
hm_n.tick_params(left=True, bottom=False, labelleft=True, labelbottom=False, right=False, top=False, labelright=False, labeltop=False, labelrotation=0)
hm_h = sns.heatmap(cm_e, ax=axes[0, 1], cbar=False, annot=True, square=True, fmt='.2f', annot_kws={'size':10}, yticklabels=cols_housing.values, xticklabels=cols_housing.values)
hm_h.tick_params(left=False, bottom=False, labelleft=False, labelbottom=False, right=True, top=False, labelright=True, labeltop=False, labelrotation=0)
hm_h.invert_xaxis()
hm_e = sns.heatmap(cm_h, ax=axes[1, 0], cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size':10}, yticklabels=cols_eeo.values, xticklabels=cols_eeo.values)
hm_e.tick_params(left=True, bottom=False, labelleft=True, labelbottom=False, right=False, top=False, labelright=False, labeltop=False, labelrotation=0)
hm_s = sns.heatmap(cm_s, ax=axes[1, 1], cbar=False, annot=True, square=True, fmt='.2f', annot_kws={'size':10}, yticklabels=cols_spirh.values, xticklabels=cols_spirh.values)
hm_s.tick_params(left=False, bottom=False, labelleft=False, labelbottom=False, right=True, top=False, labelright=True, labeltop=False, labelrotation=0)
hm_s.invert_xaxis()
axes[0, 0].set_title('NatPop Correlation Matrix for if you voted Liberal Democrat - N Largest')
axes[0, 1].set_title('Housing Correlation Matrix for if you voted Liberal Democrat - N Largest')
axes[1, 0].set_title('EEO Correlation Matrix for if you voted Liberal Democrat - N Largest')
axes[1, 1].set_title('SPIRH Correlation Matrix for if you voted Liberal Democrat - N Largest')
plt.tight_layout
plt.show()
k = 10 #number of variables for heatmap
cols_natpop = corr_natpop.nsmallest(k, 'LD')['LD'].index
cols_eeo = corr_eeo.nsmallest(k, 'LD')['LD'].index
cols_housing = corr_housing.nsmallest(k, 'LD')['LD'].index
cols_spirh = corr_spirh.nsmallest(k, 'LD')['LD'].index
cm_n = np.corrcoef(census_natpop_GE2024_df_numeric[cols_natpop].values.T)
cm_e = np.corrcoef(census_eeo_GE2024_df_numeric[cols_eeo].values.T)
cm_h = np.corrcoef(census_housing_GE2024_df_numeric[cols_housing].values.T)
cm_s = np.corrcoef(census_spirh_GE2024_df_numeric[cols_spirh].values.T)
fig, axes = plt.subplots(2, 2, figsize=(20, 15))
sns.set(font_scale=1.25)
hm_n = sns.heatmap(cm_n, ax=axes[0, 0], cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size':10}, yticklabels=cols_natpop.values, xticklabels=cols_natpop.values)
hm_n.tick_params(left=True, bottom=False, labelleft=True, labelbottom=False, right=False, top=False, labelright=False, labeltop=False, labelrotation=0)
hm_h = sns.heatmap(cm_e, ax=axes[0, 1], cbar=False, annot=True, square=True, fmt='.2f', annot_kws={'size':10}, yticklabels=cols_housing.values, xticklabels=cols_housing.values)
hm_h.tick_params(left=False, bottom=False, labelleft=False, labelbottom=False, right=True, top=False, labelright=True, labeltop=False, labelrotation=0)
hm_h.invert_xaxis()
hm_e = sns.heatmap(cm_h, ax=axes[1, 0], cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size':10}, yticklabels=cols_eeo.values, xticklabels=cols_eeo.values)
hm_e.tick_params(left=True, bottom=False, labelleft=True, labelbottom=False, right=False, top=False, labelright=False, labeltop=False, labelrotation=0)
hm_s = sns.heatmap(cm_s, ax=axes[1, 1], cbar=False, annot=True, square=True, fmt='.2f', annot_kws={'size':10}, yticklabels=cols_spirh.values, xticklabels=cols_spirh.values)
hm_s.tick_params(left=False, bottom=False, labelleft=False, labelbottom=False, right=True, top=False, labelright=True, labeltop=False, labelrotation=0)
hm_s.invert_xaxis()
axes[0, 0].set_title('NatPop Correlation Matrix for if you voted Liberal Democrat - N Smallest')
axes[0, 1].set_title('Housing Correlation Matrix for if you voted Liberal Democrat - N Smallest')
axes[1, 0].set_title('EEO Correlation Matrix for if you voted Liberal Democrat - N Smallest')
axes[1, 1].set_title('SPIRH Correlation Matrix for if you voted Liberal Democrat - N Smallest')
plt.tight_layout
plt.show()
k = 10 #number of variables for heatmap
cols_natpop = corr_natpop.nlargest(k, 'RUK')['RUK'].index
cols_eeo = corr_eeo.nlargest(k, 'RUK')['RUK'].index
cols_housing = corr_housing.nlargest(k, 'RUK')['RUK'].index
cols_spirh = corr_spirh.nlargest(k, 'RUK')['RUK'].index
cm_n = np.corrcoef(census_natpop_GE2024_df_numeric[cols_natpop].values.T)
cm_e = np.corrcoef(census_eeo_GE2024_df_numeric[cols_eeo].values.T)
cm_h = np.corrcoef(census_housing_GE2024_df_numeric[cols_housing].values.T)
cm_s = np.corrcoef(census_spirh_GE2024_df_numeric[cols_spirh].values.T)
fig, axes = plt.subplots(2, 2, figsize=(20, 15))
sns.set(font_scale=1.25)
hm_n = sns.heatmap(cm_n, ax=axes[0, 0], cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size':10}, yticklabels=cols_natpop.values, xticklabels=cols_natpop.values)
hm_n.tick_params(left=True, bottom=False, labelleft=True, labelbottom=False, right=False, top=False, labelright=False, labeltop=False, labelrotation=0)
hm_h = sns.heatmap(cm_e, ax=axes[0, 1], cbar=False, annot=True, square=True, fmt='.2f', annot_kws={'size':10}, yticklabels=cols_housing.values, xticklabels=cols_housing.values)
hm_h.tick_params(left=False, bottom=False, labelleft=False, labelbottom=False, right=True, top=False, labelright=True, labeltop=False, labelrotation=0)
hm_h.invert_xaxis()
hm_e = sns.heatmap(cm_h, ax=axes[1, 0], cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size':10}, yticklabels=cols_eeo.values, xticklabels=cols_eeo.values)
hm_e.tick_params(left=True, bottom=False, labelleft=True, labelbottom=False, right=False, top=False, labelright=False, labeltop=False, labelrotation=0)
hm_s = sns.heatmap(cm_s, ax=axes[1, 1], cbar=False, annot=True, square=True, fmt='.2f', annot_kws={'size':10}, yticklabels=cols_spirh.values, xticklabels=cols_spirh.values)
hm_s.tick_params(left=False, bottom=False, labelleft=False, labelbottom=False, right=True, top=False, labelright=True, labeltop=False, labelrotation=0)
hm_s.invert_xaxis()
axes[0, 0].set_title('NatPop Correlation Matrix for if you voted Reform UK - N Largest')
axes[0, 1].set_title('Housing Correlation Matrix for if you voted Reform UK - N Largest')
axes[1, 0].set_title('EEO Correlation Matrix for if you voted Reform UK - N Largest')
axes[1, 1].set_title('SPIRH Correlation Matrix for if you voted Reform UK - N Largest')
plt.tight_layout
plt.show()
k = 10 #number of variables for heatmap
cols_natpop = corr_natpop.nsmallest(k, 'RUK')['RUK'].index
cols_eeo = corr_eeo.nsmallest(k, 'RUK')['RUK'].index
cols_housing = corr_housing.nsmallest(k, 'RUK')['RUK'].index
cols_spirh = corr_spirh.nsmallest(k, 'RUK')['RUK'].index
cm_n = np.corrcoef(census_natpop_GE2024_df_numeric[cols_natpop].values.T)
cm_e = np.corrcoef(census_eeo_GE2024_df_numeric[cols_eeo].values.T)
cm_h = np.corrcoef(census_housing_GE2024_df_numeric[cols_housing].values.T)
cm_s = np.corrcoef(census_spirh_GE2024_df_numeric[cols_spirh].values.T)
fig, axes = plt.subplots(2, 2, figsize=(20, 15))
sns.set(font_scale=1.25)
hm_n = sns.heatmap(cm_n, ax=axes[0, 0], cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size':10}, yticklabels=cols_natpop.values, xticklabels=cols_natpop.values)
hm_n.tick_params(left=True, bottom=False, labelleft=True, labelbottom=False, right=False, top=False, labelright=False, labeltop=False, labelrotation=0)
hm_h = sns.heatmap(cm_e, ax=axes[0, 1], cbar=False, annot=True, square=True, fmt='.2f', annot_kws={'size':10}, yticklabels=cols_housing.values, xticklabels=cols_housing.values)
hm_h.tick_params(left=False, bottom=False, labelleft=False, labelbottom=False, right=True, top=False, labelright=True, labeltop=False, labelrotation=0)
hm_h.invert_xaxis()
hm_e = sns.heatmap(cm_h, ax=axes[1, 0], cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size':10}, yticklabels=cols_eeo.values, xticklabels=cols_eeo.values)
hm_e.tick_params(left=True, bottom=False, labelleft=True, labelbottom=False, right=False, top=False, labelright=False, labeltop=False, labelrotation=0)
hm_s = sns.heatmap(cm_s, ax=axes[1, 1], cbar=False, annot=True, square=True, fmt='.2f', annot_kws={'size':10}, yticklabels=cols_spirh.values, xticklabels=cols_spirh.values)
hm_s.tick_params(left=False, bottom=False, labelleft=False, labelbottom=False, right=True, top=False, labelright=True, labeltop=False, labelrotation=0)
hm_s.invert_xaxis()
axes[0, 0].set_title('NatPop Correlation Matrix for if you voted Reform UK - N Smallest')
axes[0, 1].set_title('Housing Correlation Matrix for if you voted Reform UK - N Smallest')
axes[1, 0].set_title('EEO Correlation Matrix for if you voted Reform UK - N Smallest')
axes[1, 1].set_title('SPIRH Correlation Matrix for if you voted Reform UK - N Smallest')
plt.tight_layout
plt.show()
k = 10 #number of variables for heatmap
cols_natpop = corr_natpop.nlargest(k, 'Green')['Green'].index
cols_eeo = corr_eeo.nlargest(k, 'Green')['Green'].index
cols_housing = corr_housing.nlargest(k, 'Green')['Green'].index
cols_spirh = corr_spirh.nlargest(k, 'Green')['Green'].index
cm_n = np.corrcoef(census_natpop_GE2024_df_numeric[cols_natpop].values.T)
cm_e = np.corrcoef(census_eeo_GE2024_df_numeric[cols_eeo].values.T)
cm_h = np.corrcoef(census_housing_GE2024_df_numeric[cols_housing].values.T)
cm_s = np.corrcoef(census_spirh_GE2024_df_numeric[cols_spirh].values.T)
fig, axes = plt.subplots(2, 2, figsize=(20, 15))
sns.set(font_scale=1.25)
hm_n = sns.heatmap(cm_n, ax=axes[0, 0], cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size':10}, yticklabels=cols_natpop.values, xticklabels=cols_natpop.values)
hm_n.tick_params(left=True, bottom=False, labelleft=True, labelbottom=False, right=False, top=False, labelright=False, labeltop=False, labelrotation=0)
hm_h = sns.heatmap(cm_e, ax=axes[0, 1], cbar=False, annot=True, square=True, fmt='.2f', annot_kws={'size':10}, yticklabels=cols_housing.values, xticklabels=cols_housing.values)
hm_h.tick_params(left=False, bottom=False, labelleft=False, labelbottom=False, right=True, top=False, labelright=True, labeltop=False, labelrotation=0)
hm_h.invert_xaxis()
hm_e = sns.heatmap(cm_h, ax=axes[1, 0], cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size':10}, yticklabels=cols_eeo.values, xticklabels=cols_eeo.values)
hm_e.tick_params(left=True, bottom=False, labelleft=True, labelbottom=False, right=False, top=False, labelright=False, labeltop=False, labelrotation=0)
hm_s = sns.heatmap(cm_s, ax=axes[1, 1], cbar=False, annot=True, square=True, fmt='.2f', annot_kws={'size':10}, yticklabels=cols_spirh.values, xticklabels=cols_spirh.values)
hm_s.tick_params(left=False, bottom=False, labelleft=False, labelbottom=False, right=True, top=False, labelright=True, labeltop=False, labelrotation=0)
hm_s.invert_xaxis()
axes[0, 0].set_title('NatPop Correlation Matrix for if you voted Green - N Largest')
axes[0, 1].set_title('Housing Correlation Matrix for if you voted Green - N Largest')
axes[1, 0].set_title('EEO Correlation Matrix for if you voted Green - N Largest')
axes[1, 1].set_title('SPIRH Correlation Matrix for if you voted Green - N Largest')
plt.tight_layout
plt.show()
k = 10 #number of variables for heatmap
cols_natpop = corr_natpop.nsmallest(k, 'Green')['Green'].index
cols_eeo = corr_eeo.nsmallest(k, 'Green')['Green'].index
cols_housing = corr_housing.nsmallest(k, 'Green')['Green'].index
cols_spirh = corr_spirh.nsmallest(k, 'Green')['Green'].index
cm_n = np.corrcoef(census_natpop_GE2024_df_numeric[cols_natpop].values.T)
cm_e = np.corrcoef(census_eeo_GE2024_df_numeric[cols_eeo].values.T)
cm_h = np.corrcoef(census_housing_GE2024_df_numeric[cols_housing].values.T)
cm_s = np.corrcoef(census_spirh_GE2024_df_numeric[cols_spirh].values.T)
fig, axes = plt.subplots(2, 2, figsize=(20, 15))
sns.set(font_scale=1.25)
hm_n = sns.heatmap(cm_n, ax=axes[0, 0], cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size':10}, yticklabels=cols_natpop.values, xticklabels=cols_natpop.values)
hm_n.tick_params(left=True, bottom=False, labelleft=True, labelbottom=False, right=False, top=False, labelright=False, labeltop=False, labelrotation=0)
hm_h = sns.heatmap(cm_e, ax=axes[0, 1], cbar=False, annot=True, square=True, fmt='.2f', annot_kws={'size':10}, yticklabels=cols_housing.values, xticklabels=cols_housing.values)
hm_h.tick_params(left=False, bottom=False, labelleft=False, labelbottom=False, right=True, top=False, labelright=True, labeltop=False, labelrotation=0)
hm_h.invert_xaxis()
hm_e = sns.heatmap(cm_h, ax=axes[1, 0], cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size':10}, yticklabels=cols_eeo.values, xticklabels=cols_eeo.values)
hm_e.tick_params(left=True, bottom=False, labelleft=True, labelbottom=False, right=False, top=False, labelright=False, labeltop=False, labelrotation=0)
hm_s = sns.heatmap(cm_s, ax=axes[1, 1], cbar=False, annot=True, square=True, fmt='.2f', annot_kws={'size':10}, yticklabels=cols_spirh.values, xticklabels=cols_spirh.values)
hm_s.tick_params(left=False, bottom=False, labelleft=False, labelbottom=False, right=True, top=False, labelright=True, labeltop=False, labelrotation=0)
hm_s.invert_xaxis()
axes[0, 0].set_title('NatPop Correlation Matrix for if you voted Green - N Smallest')
axes[0, 1].set_title('Housing Correlation Matrix for if you voted Green - N Smallest')
axes[1, 0].set_title('EEO Correlation Matrix for if you voted Green - N Smallest')
axes[1, 1].set_title('SPIRH Correlation Matrix for if you voted Green - N Smallest')
plt.tight_layout
plt.show()
k = 10 #number of variables for heatmap
cols_natpop = corr_natpop.nlargest(k, 'PC')['PC'].index
cols_eeo = corr_eeo.nlargest(k, 'PC')['PC'].index
cols_housing = corr_housing.nlargest(k, 'PC')['PC'].index
cols_spirh = corr_spirh.nlargest(k, 'PC')['PC'].index
cm_n = np.corrcoef(census_natpop_GE2024_df_numeric[cols_natpop].values.T)
cm_e = np.corrcoef(census_eeo_GE2024_df_numeric[cols_eeo].values.T)
cm_h = np.corrcoef(census_housing_GE2024_df_numeric[cols_housing].values.T)
cm_s = np.corrcoef(census_spirh_GE2024_df_numeric[cols_spirh].values.T)
fig, axes = plt.subplots(2, 2, figsize=(20, 15))
sns.set(font_scale=1.25)
hm_n = sns.heatmap(cm_n, ax=axes[0, 0], cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size':10}, yticklabels=cols_natpop.values, xticklabels=cols_natpop.values)
hm_n.tick_params(left=True, bottom=False, labelleft=True, labelbottom=False, right=False, top=False, labelright=False, labeltop=False, labelrotation=0)
hm_h = sns.heatmap(cm_e, ax=axes[0, 1], cbar=False, annot=True, square=True, fmt='.2f', annot_kws={'size':10}, yticklabels=cols_housing.values, xticklabels=cols_housing.values)
hm_h.tick_params(left=False, bottom=False, labelleft=False, labelbottom=False, right=True, top=False, labelright=True, labeltop=False, labelrotation=0)
hm_h.invert_xaxis()
hm_e = sns.heatmap(cm_h, ax=axes[1, 0], cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size':10}, yticklabels=cols_eeo.values, xticklabels=cols_eeo.values)
hm_e.tick_params(left=True, bottom=False, labelleft=True, labelbottom=False, right=False, top=False, labelright=False, labeltop=False, labelrotation=0)
hm_s = sns.heatmap(cm_s, ax=axes[1, 1], cbar=False, annot=True, square=True, fmt='.2f', annot_kws={'size':10}, yticklabels=cols_spirh.values, xticklabels=cols_spirh.values)
hm_s.tick_params(left=False, bottom=False, labelleft=False, labelbottom=False, right=True, top=False, labelright=True, labeltop=False, labelrotation=0)
hm_s.invert_xaxis()
axes[0, 0].set_title('NatPop Correlation Matrix for if you voted Plaid Cymru - N Largest')
axes[0, 1].set_title('Housing Correlation Matrix for if you voted Plaid Cymru - N Largest')
axes[1, 0].set_title('EEO Correlation Matrix for if you voted Plaid Cymru - N Largest')
axes[1, 1].set_title('SPIRH Correlation Matrix for if you voted Plaid Cymru - N Largest')
plt.tight_layout
plt.show()
k = 10 #number of variables for heatmap
cols_natpop = corr_natpop.nsmallest(k, 'PC')['PC'].index
cols_eeo = corr_eeo.nsmallest(k, 'PC')['PC'].index
cols_housing = corr_housing.nsmallest(k, 'PC')['PC'].index
cols_spirh = corr_spirh.nsmallest(k, 'PC')['PC'].index
cm_n = np.corrcoef(census_natpop_GE2024_df_numeric[cols_natpop].values.T)
cm_e = np.corrcoef(census_eeo_GE2024_df_numeric[cols_eeo].values.T)
cm_h = np.corrcoef(census_housing_GE2024_df_numeric[cols_housing].values.T)
cm_s = np.corrcoef(census_spirh_GE2024_df_numeric[cols_spirh].values.T)
fig, axes = plt.subplots(2, 2, figsize=(20, 15))
sns.set(font_scale=1.25)
hm_n = sns.heatmap(cm_n, ax=axes[0, 0], cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size':10}, yticklabels=cols_natpop.values, xticklabels=cols_natpop.values)
hm_n.tick_params(left=True, bottom=False, labelleft=True, labelbottom=False, right=False, top=False, labelright=False, labeltop=False, labelrotation=0)
hm_h = sns.heatmap(cm_e, ax=axes[0, 1], cbar=False, annot=True, square=True, fmt='.2f', annot_kws={'size':10}, yticklabels=cols_housing.values, xticklabels=cols_housing.values)
hm_h.tick_params(left=False, bottom=False, labelleft=False, labelbottom=False, right=True, top=False, labelright=True, labeltop=False, labelrotation=0)
hm_h.invert_xaxis()
hm_e = sns.heatmap(cm_h, ax=axes[1, 0], cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size':10}, yticklabels=cols_eeo.values, xticklabels=cols_eeo.values)
hm_e.tick_params(left=True, bottom=False, labelleft=True, labelbottom=False, right=False, top=False, labelright=False, labeltop=False, labelrotation=0)
hm_s = sns.heatmap(cm_s, ax=axes[1, 1], cbar=False, annot=True, square=True, fmt='.2f', annot_kws={'size':10}, yticklabels=cols_spirh.values, xticklabels=cols_spirh.values)
hm_s.tick_params(left=False, bottom=False, labelleft=False, labelbottom=False, right=True, top=False, labelright=True, labeltop=False, labelrotation=0)
hm_s.invert_xaxis()
axes[0, 0].set_title('NatPop Correlation Matrix for if you voted Plaid Cymru - N Smallest')
axes[0, 1].set_title('Housing Correlation Matrix for if you voted Plaid Cymru - N Smallest')
axes[1, 0].set_title('EEO Correlation Matrix for if you voted Plaid Cymru - N Smallest')
axes[1, 1].set_title('SPIRH Correlation Matrix for if you voted Plaid Cymru - N Smallest')
plt.tight_layout
plt.show()
k = 10 #number of variables for heatmap
cols_natpop = corr_natpop.nlargest(k, 'Ind')['Ind'].index
cols_eeo = corr_eeo.nlargest(k, 'Ind')['Ind'].index
cols_housing = corr_housing.nlargest(k, 'Ind')['Ind'].index
cols_spirh = corr_spirh.nlargest(k, 'Ind')['Ind'].index
cm_n = np.corrcoef(census_natpop_GE2024_df_numeric[cols_natpop].values.T)
cm_e = np.corrcoef(census_eeo_GE2024_df_numeric[cols_eeo].values.T)
cm_h = np.corrcoef(census_housing_GE2024_df_numeric[cols_housing].values.T)
cm_s = np.corrcoef(census_spirh_GE2024_df_numeric[cols_spirh].values.T)
fig, axes = plt.subplots(2, 2, figsize=(20, 15))
sns.set(font_scale=1.25)
hm_n = sns.heatmap(cm_n, ax=axes[0, 0], cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size':10}, yticklabels=cols_natpop.values, xticklabels=cols_natpop.values)
hm_n.tick_params(left=True, bottom=False, labelleft=True, labelbottom=False, right=False, top=False, labelright=False, labeltop=False, labelrotation=0)
hm_h = sns.heatmap(cm_e, ax=axes[0, 1], cbar=False, annot=True, square=True, fmt='.2f', annot_kws={'size':10}, yticklabels=cols_housing.values, xticklabels=cols_housing.values)
hm_h.tick_params(left=False, bottom=False, labelleft=False, labelbottom=False, right=True, top=False, labelright=True, labeltop=False, labelrotation=0)
hm_h.invert_xaxis()
hm_e = sns.heatmap(cm_h, ax=axes[1, 0], cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size':10}, yticklabels=cols_eeo.values, xticklabels=cols_eeo.values)
hm_e.tick_params(left=True, bottom=False, labelleft=True, labelbottom=False, right=False, top=False, labelright=False, labeltop=False, labelrotation=0)
hm_s = sns.heatmap(cm_s, ax=axes[1, 1], cbar=False, annot=True, square=True, fmt='.2f', annot_kws={'size':10}, yticklabels=cols_spirh.values, xticklabels=cols_spirh.values)
hm_s.tick_params(left=False, bottom=False, labelleft=False, labelbottom=False, right=True, top=False, labelright=True, labeltop=False, labelrotation=0)
hm_s.invert_xaxis()
axes[0, 0].set_title('NatPop Correlation Matrix for if you voted Independent - N Largest')
axes[0, 1].set_title('Housing Correlation Matrix for if you voted Independent - N Largest')
axes[1, 0].set_title('EEO Correlation Matrix for if you voted Independent - N Largest')
axes[1,1].set_title('SPIRH Correlation Matrix for if you voted Independent - N Largest')
plt.tight_layout
plt.show()
k = 10 #number of variables for heatmap
cols_natpop = corr_natpop.nsmallest(k, 'Ind')['Ind'].index
cols_eeo = corr_eeo.nsmallest(k, 'Ind')['Ind'].index
cols_housing = corr_housing.nsmallest(k, 'Ind')['Ind'].index
cols_spirh = corr_spirh.nsmallest(k, 'Ind')['Ind'].index
cm_n = np.corrcoef(census_natpop_GE2024_df_numeric[cols_natpop].values.T)
cm_e = np.corrcoef(census_eeo_GE2024_df_numeric[cols_eeo].values.T)
cm_h = np.corrcoef(census_housing_GE2024_df_numeric[cols_housing].values.T)
cm_s = np.corrcoef(census_spirh_GE2024_df_numeric[cols_spirh].values.T)
fig, axes = plt.subplots(2, 2, figsize=(20, 15))
sns.set(font_scale=1.25)
hm_n = sns.heatmap(cm_n, ax=axes[0, 0], cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size':10}, yticklabels=cols_natpop.values, xticklabels=cols_natpop.values)
hm_n.tick_params(left=True, bottom=False, labelleft=True, labelbottom=False, right=False, top=False, labelright=False, labeltop=False, labelrotation=0)
hm_h = sns.heatmap(cm_e, ax=axes[0, 1], cbar=False, annot=True, square=True, fmt='.2f', annot_kws={'size':10}, yticklabels=cols_housing.values, xticklabels=cols_housing.values)
hm_h.tick_params(left=False, bottom=False, labelleft=False, labelbottom=False, right=True, top=False, labelright=True, labeltop=False, labelrotation=0)
hm_h.invert_xaxis()
hm_e = sns.heatmap(cm_h, ax=axes[1, 0], cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size':10}, yticklabels=cols_eeo.values, xticklabels=cols_eeo.values)
hm_e.tick_params(left=True, bottom=False, labelleft=True, labelbottom=False, right=False, top=False, labelright=False, labeltop=False, labelrotation=0)
hm_s = sns.heatmap(cm_s, ax=axes[1, 1], cbar=False, annot=True, square=True, fmt='.2f', annot_kws={'size':10}, yticklabels=cols_spirh.values, xticklabels=cols_spirh.values)
hm_s.tick_params(left=False, bottom=False, labelleft=False, labelbottom=False, right=True, top=False, labelright=True, labeltop=False, labelrotation=0)
hm_s.invert_xaxis()
axes[0, 0].set_title('NatPop Correlation Matrix for if you voted Independent - N Smallest')
axes[0, 1].set_title('Housing Correlation Matrix for if you voted Independent - N Smallest')
axes[1, 0].set_title('EEO Correlation Matrix for if you voted Independent - N Smallest')
axes[1,1].set_title('SPIRH Correlation Matrix for if you voted Independent - N Smallest')
plt.tight_layout
plt.show()
Let's do some more plots to gleam some more info.
A pairplot could be handy to see how much how much voters bleed into each other. How convict in their votes are labour voters compared to plaid voters. The way I interpret these is that 'X axis' is our 'likelihood' to vote for that party and the 'y axis' is our 'shared probability' of voting for another party'
sns.set()
cols = ['Con', 'Lab', 'RUK', 'LD', 'PC', 'Green', 'Ind']
pp=sns.pairplot(census_natpop_GE2024_df[cols],kind='reg',plot_kws={'line_kws':{'color':'red'}})
plt.show()
I don't think this plot is very useful, but it's cool to look at. I don't think this next plot is very useful either but I think it is very interesting. Basically, what did the share of the vote did a party win a constituency with. Basically tells us how popular the party was, the higher the mean, the more liked they were in the constituency.
lab = census_natpop_GE2024_df[census_natpop_GE2024_df['Elected'] == 'Lab']
con = census_natpop_GE2024_df[census_natpop_GE2024_df['Elected'] == 'Con']
ld = census_natpop_GE2024_df[census_natpop_GE2024_df['Elected'] == 'LD']
ruk = census_natpop_GE2024_df[census_natpop_GE2024_df['Elected'] == 'RUK']
green = census_natpop_GE2024_df[census_natpop_GE2024_df['Elected'] == 'Green']
pc = census_natpop_GE2024_df[census_natpop_GE2024_df['Elected'] == 'PC']
ind = census_natpop_GE2024_df[census_natpop_GE2024_df['Elected'] == 'Ind']
n_bins = 20
ax = sns.histplot(lab["Lab"], bins=n_bins, label = "Labour", color='r')
ax = sns.histplot(con["Con"], bins=n_bins, label = "Conservative", color='b')
ax = sns.histplot(ld["LD"], bins=n_bins, label = "Liberal Democrat", color='y')
ax = sns.histplot(ruk["RUK"], bins=n_bins, label = "Reform UK", color='c')
ax = sns.histplot(green["Green"], bins=n_bins, label = "Green", color='lime')
ax = sns.histplot(pc["PC"], bins=n_bins, label = "Plaid Cymru", color='g')
ax = sns.histplot(ind["Ind"], bins=n_bins, label = "Independent", color='pink')
ax.legend()
ax.set(xlabel='Vote Share', ylabel='Frequency')
ax.set_title('Vote share for (all) winning party of a Consituency')
n_bins = 5
ax = sns.histplot(ruk["RUK"], bins=n_bins, label = "Reform UK", color='c')
ax = sns.histplot(green["Green"], bins=n_bins, label = "Green", color='lime')
ax = sns.histplot(pc["PC"], bins=n_bins, label = "Plaid Cymru", color='g')
ax = sns.histplot(ind["Ind"], bins=n_bins, label = "Independent", color='pink')
ax.legend()
ax.set(xlabel='Vote Share', ylabel='Frequency')
ax.set_title('Vote share for (smaller) winning party of a Consituency')
print("Labour - Description")
lab['Lab'].describe()
print("Conservative - Description")
con['Con'].describe()
print("Liberal Democrat - Description")
ld['LD'].describe()
print("Reform UK - Description")
ruk['RUK'].describe()
print("Green - Description")
green['Green'].describe()
print("Plaid Cymru - Description")
pc['PC'].describe()
print("Independent - Description")
ind['Ind'].describe()
Frankly, it's clear that the Conservatives, Independent and Reform UK were unpopular and won by a very small margin quite often. Compared to Liberal Democrats, Green, Plaid Cymru and Labour who hover just about the 50% mark, so not quite the "clear majority" but more support than the cons. However, it is quite clear that Independent, Plaid, Green and Reform comprise a small portion of the actual data set so it's likely that these will have very poor fits.
So let's build our ML model then.
Just thinking actually, I said there are 575 constituencies. 32 in Wales and 543 in England. If we do a 60-20-20 split, that would mean that Wales (in an ideal random split) would have constituency make-up like 20-6-6, or 19-7-6, or some other similar permutation. This does mean all of the PC votes (a total of 4), could end up in one split. Which would be bad for the other two splits. This also follows with Green and Reform, as these also have a small pool of data to pull from. It may be wise to group these smaller parties into an 'other' category.
Despite my joined Census-Election dataframe, I'm going to go back to my separated table as that's basically my X and Y there.
X1 = census_housing_pcon_df.copy()
X2 = census_eeo_pcon_df.copy()
X3 = census_spirh_pcon_df.copy()
X4 = census_natpop_pcon_df.copy()
# Merge them together
X = X1.merge(X2, on='PCON25CD', how='inner', copy=False)
X = X.merge(X3, on='PCON25CD', how='inner', copy=False)
X = X.merge(X4, on='PCON25CD', how='inner', copy=False)
X_index = X["PCON25CD"] # Index list (just in case)
X.drop(columns=['PCON25CD'], inplace=True, axis=1) # drop useless cols
X.head()
Y = GE2024_frac_df.copy()
#other_cols = ['PC', 'RUK', 'Green','Ind','LD']
#Y['Other'] = Y[other_cols].sum(axis=1)
#Y.drop(columns=other_cols, inplace=True)
Y.drop(columns=["PCON25NM"], inplace=True)
#Y['Elected'].replace({'LD':'Other','PC':'Other','RUK':'Other','Green':'Other','Ind':'Other'}, inplace=True)
Y.head()
This caused more annoyance than it should but I couldn't work out what party "Spk" was then I realised it was the Speaker. So drop row 137 to bin him off
# Get rid of the speaker
X = X.drop(137)
Y = Y.drop(137)
# Calculate the proportion of votes that labour got
lab_count = Y['Elected'].value_counts().get('Lab', 0)
total_count = Y['Elected'].count()
lab_percentage = (lab_count / total_count) * 100
print("Labour obtained: %.2f%% of the constituencies" % lab_percentage)
As we can see Labour have a lot more than 25% of the seats, so we can assume that actually our null hypothesis would be that "Our model is no better than just guessing labour" which should have an accuracy of approximately 65%.
We need to ensure our model is accurate more than 65% for us to be able to say that it's actually doing anything.
I did initially separate these into 'other' but we can't do this for regressors
Now I can't decided if the 'Elected' Tab or the Percent Tabs are better so I'll just separate these into two different ones.
Also let's delete the Scottish and Irish data.
Y = Y[Y["PCON25CD"].str.contains("S") == False]
Y = Y[Y["PCON25CD"].str.contains("N") == False]
Y_index = Y["PCON25CD"]
Y1 = Y.copy().drop(columns=['Elected','PCON25CD'])
Y2 = Y.copy().loc[:, 'Elected']
Y1.head()
Y2.head()
We need to get "SMOTE" in here as the size of any other party (and Independent) compared to Labour is tiny. So we'll oversample to get some more data!
party_to_num_map = {'Lab': 0, 'Con': 1, 'LD': 2, 'Green': 3, 'RUK': 4, 'PC': 5, 'Ind': 6}
num_to_party_map = {v: k for k, v in party_to_num_map.items()}
Y2.replace(party_to_num_map,inplace=True)
#Y2 = pd.to_numeric(Y2, errors='coerce')
# Plot a count plot
plt.figure(figsize=(10, 6))
sns.countplot(x=Y2.replace(num_to_party_map))
plt.title('Count of Each Party in Y2')
plt.xlabel('Party')
plt.ylabel('Count')
plt.show()
sm = SMOTE(random_state=42, k_neighbors=3)
X_res, Y2_res = sm.fit_resample(X, Y2)
# Plot a count plot
plt.figure(figsize=(10, 6))
sns.countplot(x=Y2_res)
plt.title('Count of Each Party in Y2_res')
plt.xlabel('Party')
plt.ylabel('Count')
plt.show()
"""kbins = KBinsDiscretizer(n_bins=7, encode='ordinal', strategy='uniform')
Y1_binned = kbins.fit_transform(Y1.to_numpy().reshape(-1, 1)).ravel()
print(Y1_binned)"""
"""smote = SMOTE(random_state=42)
X_res, Y1_res = smote.fit_resample(X, Y1)
print(Y1)"""
x1_train, x1_test, y1_train, y1_test = train_test_split(X, Y1, test_size=0.2, random_state=random_code)
x2_train, x2_test, y2_train, y2_test = train_test_split(X_res, Y2_res, test_size=0.2, random_state=random_code)
Okay so this is our final total in the test data, we have to get a 62% or better technically. But, should aim for 65%. We can go into significance later. Let's just aim for better than 65% for now.
param_grid = {
'criterion': ['gini', 'entropy'],
'splitter': ['best', 'random'],
'max_depth': [None,5, 10, 20, 30, 40, 50],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
'max_features': [None, 'auto', 'sqrt', 'log2']
}
# Define an initial tree
tree = DecisionTreeClassifier(random_state = random_code)
# Perform Grid Search
grid_search = GridSearchCV(estimator=tree, param_grid=param_grid, cv=5, n_jobs=-1, verbose=2)
grid_search.fit(x2_train, y2_train)
best_params = grid_search.best_params_
print("Best parameters found: ", best_params)
# Build a new tree based on the best parameters
tree_best = DecisionTreeClassifier(**best_params, random_state=random_code)
tree_best.fit(x2_train, y2_train)
tree_best.score(x2_train,y2_train)
# Define the feature importances and sort them in descending order
importances = tree_best.feature_importances_
indices = np.argsort(importances)[::-1]
"""print("Feature ranking:")
for f in range(x2_train.shape[1]):
print(f"{f + 1}. feature {indices[f]} ({importances[indices[f]]})")"""
top_features_indices = indices[:10] # Adjust the number of top features as needed
top_features = x2_train.columns[top_features_indices]
# Calculate the importances and standard deviations for the top features
plot_importances = tree_best.feature_importances_[top_features_indices]
# Create a Series for the top feature importances
tree_importances = pd.Series(plot_importances, index=top_features)
# Plot the importances with error bars
fig, ax = plt.subplots()
tree_importances.plot.barh(ax=ax)
ax.set_title("Feature importances using MDI")
ax.set_ylabel("Mean decrease in impurity")
fig.tight_layout()
plt.show()
# Assuming 'importances' and 'indices' are already defined
top_features_indices = indices[:10] # Adjust the number of top features as needed
top_features = x2_train.columns[top_features_indices]
# Select the top features from the DataFrame
x2_train_top = x2_train[top_features]
x2_test_top = x2_test[top_features]
# Train the model with top features
tree_top = DecisionTreeClassifier(**best_params, random_state=random_code)
tree = tree_top.fit(x2_train_top, y2_train)
tree_prediction = tree.predict(x2_test_top)
print("Accuracy:", round(accuracy_score(y2_test,tree_prediction), 3))
"""plt.figure(figsize = [60.0, 20.0])
class_names = [str(cls) for cls in tree.classes_]
_ = plot_tree(tree,
feature_names = x2_test.columns,
class_names = class_names,
filled = True,
rounded = False,
proportion = True,
fontsize = 11)
for o in _:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor('red')
arrow.set_linewidth(3)"""
Let's first try with a random forest. We'll start with some initial values and do our model optimisation later.
# Init the param grid and the model
param_grid = {
'n_estimators': [100, 200, 300],
'max_features': ['auto', 'sqrt', 'log2'],
'max_depth': [None, 10, 20, 30],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
'bootstrap': [True, False]
}
rfc_ = RandomForestClassifier(random_state = random_code)
# Grid search to find the best parameters
grid_search = GridSearchCV(estimator=rfc_, param_grid=param_grid, cv=3, n_jobs=-1, verbose=2)
grid_search.fit(x2_train, y2_train)
best_params = grid_search.best_params_
print("Best parameters found: ", best_params)
# Generate new model and predict based on the best parameters
rfc_best = RandomForestClassifier(**best_params, random_state=random_code)
rfc = rfc_best.fit(x2_train,y2_train)
# Get the feature importances and sort them in descending order
importances = rfc.feature_importances_
indices = np.argsort(importances)[::-1]
"""print("Feature ranking:")
for f in range(x2_train.shape[1]):
print(f"{f + 1}. feature {indices[f]} ({importances[indices[f]]})")"""
# Assuming 'importances' and 'indices' are already defined
top_features_indices = indices[:10] # Adjust the number of top features as needed
top_features = x2_train.columns[top_features_indices]
# Calculate the importances and standard deviations for the top features
plot_importances = rfc.feature_importances_[top_features_indices]
std = np.std([tree.feature_importances_[top_features_indices] for tree in rfc.estimators_], axis=0)
# Create a Series for the top feature importances
forest_importances = pd.Series(plot_importances, index=top_features)
# Plot the importances with error bars
fig, ax = plt.subplots()
forest_importances.plot.barh(yerr=std, ax=ax)
ax.set_title("Feature importances using MDI")
ax.set_ylabel("Mean decrease in impurity")
fig.tight_layout()
plt.show()
# Assuming 'importances' and 'indices' are already defined
top_features_indices = indices[:10] # Adjust the number of top features as needed
top_features = x2_train.columns[top_features_indices]
# Select the top features from the DataFrame
x2_train_top = x2_train[top_features]
x2_test_top = x2_test[top_features]
# Train the model with top features
rfc_top = RandomForestClassifier(**best_params, random_state=random_code)
rfc = rfc_top.fit(x2_train_top, y2_train)
# Cross validate the accuracy
accuracyRFC_train = cross_val_score(estimator=rfc, X = x2_train_top, y = y2_train, cv=5)
accuracyRFC_test = cross_val_score(estimator=rfc, X = x2_test_top, y = y2_test, cv=5)
print("The CV Score for the training data is %.3f%%, and for the test data it is %.3f%%" %(accuracyRFC_train.mean(), accuracyRFC_test.mean()))
# Calculate the predictive accuracy
rfc_predictions = rfc.predict(x2_test_top)
print("Accuracy:", round(accuracy_score(y2_test,rfc_predictions), 3))
----- UN-USED -----
"""param_grid = {
'n_estimators': [50, 100, 150],
'max_features': ['auto', 'sqrt', 'log2'],
'max_depth': [None, 3, 5, 10, 20],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
'bootstrap': [True, False]
}
rfr_ = RandomForestRegressor(random_state = random_code)
grid_search = GridSearchCV(estimator=rfr_, param_grid=param_grid, cv=3, n_jobs=-1, verbose=2)
grid_search.fit(x1_train, y1_train)
best_params = grid_search.best_params_
print("Best parameters found: ", best_params)
rfr_best_search = RandomForestRegressor(**best_params, random_state=random_code)
rfr_best_search.fit(x1_train, y1_train)
search_score = rfr_best_search.score(x1_train,y1_train)
print(search_score)
rfr_ = RandomForestRegressor(random_state = random_code)
grid_search = RandomizedSearchCV(estimator=rfr_, param_distributions=param_grid, cv=3, n_jobs=-1, n_iter=20)
grid_search.fit(x1_train, y1_train)
best_params = grid_search.best_params_
print("Best parameters found: ", best_params)
rfr_best_random = RandomForestRegressor(**best_params, random_state=random_code)
rfr_best_random.fit(x1_train, y1_train)
random_score = rfr_best_random.score(x1_train,y1_train)
print(random_score)
if random_score > search_score:
rfr_best = rfr_best_random
else:
rfr_best = rfr_best_search
importances = rfr_best.feature_importances_
indices = np.argsort(importances)[::-1]
print("Feature ranking:")
for f in range(min(10, x1_train.shape[1])): # Ensure we do not exceed the number of features
print(f"{f + 1}. feature {indices[f]} ({importances[indices[f]]})")
# Assuming 'importances' and 'indices' are already defined
top_features_indices = indices[:10] # Adjust the number of top features as needed
top_features = x1_train.columns[top_features_indices]
# Calculate the importances and standard deviations for the top features
plot_importances = rfr_best.feature_importances_[top_features_indices]
std = np.std([tree.feature_importances_[top_features_indices] for tree in rfr_best.estimators_], axis=0)
# Create a Series for the top feature importances
forest_importances = pd.Series(plot_importances, index=top_features)
# Plot the importances with error bars
fig, ax = plt.subplots()
forest_importances.plot.bar(yerr=std, ax=ax)
ax.set_title("Feature importances using MDI")
ax.set_ylabel("Mean decrease in impurity")
fig.tight_layout()
plt.show()
# Assuming 'importances' and 'indices' are already defined
top_features_indices = indices[:10] # Adjust the number of top features as needed
top_features = x1_train.columns[top_features_indices]
# Select the top features from the DataFrame
x1_train_top = x1_train[top_features]
x1_test_top = x1_test[top_features]
# Train the model with top features
rfr_top = RandomForestRegressor(**best_params, random_state=random_code)
rfr = rfr_top.fit(x1_train_top, y1_train)
print(rfr.score(x1_train_top, y1_train))
rfr_predictions = rfr.predict(x1_test_top)
rfr_predictions"""
"""columns = y1_test.columns
for idx, col in enumerate(columns):
plt.plot(y1_test[col], rfr_predictions[:, columns.get_loc(col)], 'o', label=col)
# Calculate regression metrics
mse = mean_squared_error(y1_test[col], rfr_predictions[:,idx])
mae = mean_absolute_error(y1_test[col], rfr_predictions[:,idx])
r2 = r2_score(y1_test[col], rfr_predictions[:,idx])
print(col,": Mean Squared Error (MSE):", round(mse, 2))
print(col,": Mean Absolute Error (MAE):", round(mae, 2))
print(col,": R² Score:", round(r2, 2))
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Random Forest Regressor Predictions')
plt.legend()"""
"""columns = y1_test.columns
colours = ['b','r','orange','cyan','olivedrab','springgreen','pink']
plt.figure(figsize=(20,14))
for idx, col in enumerate(columns):
ax = sns.regplot(x=y1_test[col], y=rfr_predictions[:,idx], ci=100, line_kws={'color':'red'},label=col, color=colours[idx])
mean_prediction = rfr_predictions[:, idx].mean()
mean_test = y1_test[col].mean()
ax.scatter(mean_test,mean_prediction, color='k',s=300, marker='o')
ax.scatter(mean_test,mean_prediction, color=colours[idx],s=200, marker='o')
# Calculate regression metrics
mse = mean_squared_error(y1_test[col], rfr_predictions[:,idx])
mae = mean_absolute_error(y1_test[col], rfr_predictions[:,idx])
r2 = r2_score(y1_test[col], rfr_predictions[:,idx])
print(col,": Mean Squared Error (MSE):", round(mse, 2))
print(col,": Mean Absolute Error (MAE):", round(mae, 2))
print(col,": R² Score:", round(r2, 2))
ax.legend()
ax.set(xlabel='Actual Values', ylabel='Predicted Values')
ax.set_title('Random Forest Regressor Predictions')
rfr_predictions_trans = []
for row in rfr_predictions:
rfr_predictions_trans.append(columns[np.argmax(row)])
print(rfr_predictions_trans)
print("Accuracy:", round(accuracy_score(y2_test,rfr_predictions_trans), 3))"""
We'll do a K-Nearest Neighbour. First let's try the classifier.
# Init the param grid and the model
param_grid = {
'n_neighbors': [3, 5, 7, 10, 15, 20],
'weights': ['uniform', 'distance'],
'algorithm': ["auto", "ball_tree", "kd_tree", "brute"],
'leaf_size': [10, 20, 30, 40, 50,],
'p': [1, 2, 3]
}
knnc = KNeighborsClassifier()
# Grid search for best parameters
grid_search = GridSearchCV(estimator=knnc, param_grid=param_grid, cv=3, n_jobs=-1, verbose=2)
grid_search.fit(x2_train, y2_train)
best_params = grid_search.best_params_
print("Best parameters found: ", best_params)
# Generate a new model based on the best parameters
knnc_best = KNeighborsClassifier(**best_params)
knnc_best.fit(x2_train, y2_train)
importances = permutation_importance(knnc_best, x2_test, y2_test)
indices = importances.importances_mean.argsort()[::-1]
# Assuming 'importances' and 'indices' are already defined
top_features_indices = indices[:10] # Adjust the number of top features as needed
top_features = x2_train.columns[top_features_indices]
top_features_importance = importances.importances_mean[top_features_indices]
# Create a Series for the top feature importances
knnc_importances = pd.Series(top_features_importance, index=top_features)
# Plot the importances with error bars
fig, ax = plt.subplots()
knnc_importances.plot.barh(ax=ax)
ax.set_title("Feature importances using MDI")
ax.set_ylabel("Mean decrease in impurity")
fig.tight_layout()
plt.show()
# Assuming 'importances' and 'indices' are already defined
top_features_indices = indices[:10] # Adjust the number of top features as needed
top_features = x2_train.columns[top_features_indices]
# Select the top features from the DataFrame
x2_train_top = x2_train[top_features]
x2_test_top = x2_test[top_features]
# Train the model with top features
knnc_top = KNeighborsClassifier(**best_params)
knnc = knnc_top.fit(x2_train_top, y2_train)
# Cross validate the accuracy
accuracyKNNC_train = cross_val_score(estimator=knnc, X = x2_train_top, y = y2_train, cv=5)
accuracyKNNC_test = cross_val_score(estimator=knnc, X = x2_test_top, y = y2_test, cv=5)
print("The CV Score for the training data is %.3f%%, and for the test data it is %.3f%%" %(accuracyKNNC_train.mean(), accuracyKNNC_test.mean()))
# Calculate the predictive accuracy
knnc_predictions = knnc.predict(x2_test_top)
print("Accuracy:", round(accuracy_score(y2_test,knnc_predictions), 3))
----- UN-USED -----
"""param_grid = {
'n_neighbors': [3, 5, 7, 10, 15, 20],
'weights': ['uniform', 'distance'],
'algorithm': ["auto", "ball_tree", "kd_tree", "brute"],
'leaf_size': [10, 20, 30, 40, 50,],
'p': [1, 2, 3]
}
knnr = KNeighborsRegressor()
grid_search = GridSearchCV(estimator=knnr, param_grid=param_grid, cv=3, n_jobs=-1, verbose=2)
grid_search.fit(x1_train, y1_train)
best_params = grid_search.best_params_
print("Best parameters found: ", best_params)
knnr_best = KNeighborsRegressor(**best_params)
knnr_best.fit(x1_train, y1_train)
knnr_best.score(x1_train,y1_train)
predictions = knnr_best.predict(x1_train)
knnr_best.score(x1_test,y1_test)
accuracyRFM = cross_val_score(estimator=knnr_best, X = x1_train, y = y1_train, cv=5)
accuracyRFM.mean()
knnr_predictions = knnr_best.predict(x1_test)
knnr_predictions
columns = y1_test.columns
colours = ['b','r','orange','cyan','olivedrab','springgreen','pink']
plt.figure(figsize=(20,14))
for idx, col in enumerate(columns):
ax = sns.regplot(x=y1_test[col], y=knnr_predictions[:,idx], ci=50, line_kws={'color':'purple'},label=col, color=colours[idx])
mean_prediction = knnr_predictions[:, idx].mean()
mean_test = y1_test[col].mean()
ax.scatter(mean_test,mean_prediction, color='k',s=300, marker='o')
ax.scatter(mean_test,mean_prediction, color=colours[idx],s=200, marker='o')
# Calculate regression metrics
mse = mean_squared_error(y1_test[col], knnr_predictions[:,idx])
mae = mean_absolute_error(y1_test[col], knnr_predictions[:,idx])
r2 = r2_score(y1_test[col], knnr_predictions[:,idx])
print(col,": Mean Squared Error (MSE):", round(mse, 2))
print(col,": Mean Absolute Error (MAE):", round(mae, 2))
print(col,": R² Score:", round(r2, 2))
ax.legend()
ax.set(xlabel='Actual Values', ylabel='Predicted Values')
ax.set_title('K Nearest Neighbour Predictions')
knnr_predictions_trans = []
for row in knnr_predictions:
knnr_predictions_trans.append(columns[np.argmax(row)])
print(knnr_predictions_trans)"""
gnb = GaussianNB() # Initial NB
# Search for best parameters
grid_search = GridSearchCV(estimator=gnb, param_grid={'var_smoothing': np.logspace(0, -20, num=200)}, cv=5, n_jobs=-1, verbose=2)
grid_search.fit(x2_train, y2_train)
#grid_search.fit(x2_train_scaled, y2_train.ravel())
best_params = grid_search.best_params_
print("Best parameters found: ", best_params)
# Get a new NB based on the best parameters
gnb_best = GaussianNB(**best_params)
gnb_best.fit(x2_train, y2_train)
"""importances = permutation_importance(gnb_best, x2_test, y2_test)
indices = importances.importances_mean.argsort()[::-1]
# Assuming 'importances' and 'indices' are already defined
top_features_indices = indices[:10] # Adjust the number of top features as needed
top_features = x2_train.columns[top_features_indices]
top_features_importance = importances.importances_mean[top_features_indices]
# Create a Series for the top feature importances
gnb_importances = pd.Series(top_features_importance, index=top_features)
# Plot the importances with error bars
fig, ax = plt.subplots()
gnb_importances.plot.barh(ax=ax)
ax.set_title("Feature importances using MDI")
ax.set_ylabel("Mean decrease in impurity")
fig.tight_layout()
plt.show()
# Assuming 'importances' and 'indices' are already defined
top_features_indices = indices[:10] # Adjust the number of top features as needed
top_features = x2_train.columns[top_features_indices]
# Select the top features from the DataFrame
x2_train_top = x2_train[top_features]
x2_test_top = x2_test[top_features]
# Train the model with top features
gnb_top = GaussianNB(**best_params)
gnb = gnb_top.fit(x2_train_top, y2_train)
""" # Accuracy: 0.492 - With Feature Matching Above
# Cross validate the score.
accuracyGNB_train = cross_val_score(estimator=gnb_best, X = x2_train, y = y2_train, cv=5)
accuracyGNB_test = cross_val_score(estimator=gnb_best, X = x2_test, y = y2_test, cv=5)
print("The CV Score for the training data is %.3f%%, and for the test data it is %.3f%%" %(accuracyGNB_train.mean(), accuracyGNB_test.mean()))
# Calculate the predictive accuracy
gnb_predictions = gnb_best.predict(x2_test)
print("Accuracy:", round(accuracy_score(y2_test,gnb_predictions), 3))
# Init the param grid and the model
param_grid = {
'C': [0.1, 1, 10],
'kernel': ['linear', 'rbf'],
'gamma': ['scale', 'auto'],
'degree': [2, 3, 4]
}
svc = SVC(random_state=random_code, probability=True)
# Grid search for best parameters
grid_search = GridSearchCV(estimator=svc, param_grid=param_grid, cv=3, n_jobs=-1, verbose=2)
grid_search.fit(x2_train, y2_train)
best_params = grid_search.best_params_
print("Best parameters found: ", best_params)
# Generate a new model based on the best parameters
svc_best = SVC(**best_params, random_state=random_code, probability=True)
svc_best.fit(x2_train, y2_train)
"""importances = permutation_importance(svc_best, x2_test, y2_test)
indices = importances.importances_mean.argsort()[::-1]
# Assuming 'importances' and 'indices' are already defined
top_features_indices = indices[:10] # Adjust the number of top features as needed
top_features = x2_train.columns[top_features_indices]
top_features_importance = importances.importances_mean[top_features_indices]
# Create a Series for the top feature importances
svc_importances = pd.Series(top_features_importance, index=top_features)
# Plot the importances with error bars
fig, ax = plt.subplots()
svc_importances.plot.barh(ax=ax)
ax.set_title("Feature importances using MDI")
ax.set_ylabel("Mean decrease in impurity")
fig.tight_layout()
plt.show()
# Assuming 'importances' and 'indices' are already defined
top_features_indices = indices[:10] # Adjust the number of top features as needed
top_features = x2_train.columns[top_features_indices]
# Select the top features from the DataFrame
x2_train_top = x2_train[top_features]
x2_test_top = x2_test[top_features]
# Train the model with top features
svc_top = SVC(**best_params, random_state=random_code, probability=True)
svc = svc_top.fit(x2_train_top, y2_train)
""" # Accuracy: 0.609 - With Feature Matching Above
# Cross validate the accuracy
accuracySVC_train = cross_val_score(estimator=svc_best, X = x2_train, y = y2_train, cv=5)
accuracySVC_test = cross_val_score(estimator=svc_best, X = x2_test, y = y2_test, cv=5)
print("The CV Score for the training data is %.3f%%, and for the test data it is %.3f%%" %(accuracySVC_train.mean(), accuracySVC_test.mean()))
# Calculate the predictive accuracy
svc_predictions = svc_best.predict(x2_test)
print("Accuracy:", round(accuracy_score(y2_test,svc_predictions), 3))
Let's vote and see what happens.
#ensemble_model = VotingClassifier(estimators=[('rfr', rfr), ('knnr', knnr_best)], voting='soft')
ensemble_model = VotingClassifier(estimators=[('rfc', rfc), ('tree', tree), ('knnc', knnc_best), ('gnb', gnb_best), ('svc', svc_best)], voting='soft')
ensemble_model.fit(x2_train, y2_train)
ensemble_predictions = ensemble_model.predict(x2_test)
print("Ensemble model score: ", ensemble_model.score(x2_test, y2_test))
print("Accuracy:", round(accuracy_score(y2_test,ensemble_predictions), 3))
mapped_ensemble_predictions = [num_to_party_map[letter] for letter in ensemble_predictions]
mapped_y2_test = [num_to_party_map[letter] for letter in y2_test]
result = pd.DataFrame({'Truth': mapped_y2_test, 'Prediction': mapped_ensemble_predictions}, columns=['Truth', 'Prediction'])
result['Match'] = result['Truth'] == result['Prediction']
result.to_csv(os.path.join('CensusElection_PredictorV2.csv'), index=False)
So what are my thoughts: